Lightweight Structured Text Processing
نویسندگان
چکیده
Text is a popular storage and distribution format for information, partly due to generic text-processing tools like Unix grep and sort. Unfortunately, existing generic tools make assumptions about text format (e.g., each line is a record) that limit their applicability. Custom-built tools are one alternative, but they require substantial time investment and programming expertise. We describe a new approach, lightweight structured text processing, which overcomes these difficulties by enabling users to define text structure interactively and manipulate the structure with generic tools. Our prototype system, LAPIS, is a web browser that can highlight, filter, and sort text regions described by the user. LAPIS has several advantages over other systems: (1) the ability to define custom structure with a simple, intuitive pattern language; (2) interactive specification, showing pattern matches in context and letting users choose the most convenient combination of manual selection and pattern matching; and (3) external parsers for standard text formats. The pattern language in LAPIS, text constraints, describes text structure in high-level terms, with region relationships like before, after, in, and contains. We describe an implementation of text constraints using a novel, compact representation of region sets as collections of rectangles, or region intervals. We also illustrate some examples of applying LAPIS to web pages, text files, and source code. Appeared in Proceedings of USENIX 1999 Annual Technical Conference, Monterey, CA, June 1999, pp 131–144. Outstanding paper award.
منابع مشابه
Presenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملOn Lightweight Data Summaries for Optimised Query Processing over Linked Data
Typical approaches for querying structured Web Data collect (crawl) and pre-process (index) large amounts of data in a central data repository before allowing for query answering. This time-consuming pre-processing phase however leverages the benefits of Linked Data – where structured data is accessible live and up-to-date at distributed Web resources that may change constantly – only to a limi...
متن کاملEffects of Structured Input and Meaningful Output on EFL Learners' Acquisition of Nominal Clauses
The current second language (L2) instruction research has raised great motivation for the use of both processing instruction and meaningful output instruction tasks in L2 classrooms as the two focus-on-form (FonF) instructional tasks. The present study investigated the effect of structured input tasks (represented by referential and affective tasks) compared with meaningful output tasks (implem...
متن کاملAnswering Definition Questions via Temporally-Anchored Text Snippets
A lightweight extraction method derives text snippets associated to dates from the Web. The snippets are organized dynamically into answers to definition questions. Experiments on standard test question sets show that temporally-anchored text snippets allow for efficiently answering definition questions at accuracy levels comparable to the best systems, without any need for complex lexical reso...
متن کاملAssignment of ontology-based broad semantic classes to biomedical text
Natural language processing of biomedical text benefits from the ability to recognize broad semantic classes, but the number of semantic types is far bigger than is usually treated in newswire text. A method for broad semantic class assignment using lightweight linguistic analysis is described and evaluated using traditional and novel methods.
متن کامل